Reference Distributions#

In this notebook we compute the reference distributions for each catchment in the sample to represent the baseline period of record (POR) flow duration curve for validation.

from multiprocessing import Pool
from functools import partial

import os
import pandas as pd
import json
import numpy as np
import geopandas as gpd
import xarray as xr

from evaluation_metrics import EvaluationMetrics

from scipy.stats import laplace

from pathlib import Path
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
import numpy as np
output_notebook()

import data_processing_functions as dpf
BASE_DIR = os.getcwd()

# update this to the path where you stored `HYSETS_2023_update_QC_stations.nc`
HYSETS_DIR = Path('/home/danbot/code/common_data/HYSETS')
Loading BokehJS ...

Data pre-processing overview#

Note that these steps are optional and the end results of these pre-processing steps are provided in the open data repository.

get updated data sources and validate catchment attributes#

  1. Extract catchment attributes using updated catchment geometries where available (optional, updated catchment geometries are saved in data/BCUB_watershed_bounds_updated.geojson).

  2. Process climate indices for HYSETS catchments in the study region (optional, pre-processed attributes are contained in BCUB_watershed_attributes_updated.csv)

Import HYSETS catchment attributes#

# import the HYSETS attributes data
hysets_df = pd.read_csv(HYSETS_DIR / 'HYSETS_watershed_properties.txt', sep=';')
ws_id_dict = hysets_df.set_index('Official_ID')['Watershed_ID'].to_dict()
da_dict = hysets_df.set_index('Official_ID')['Drainage_Area_km2'].to_dict()
official_id_dict = {row['Official_ID']: row['Watershed_ID'] for _, row in hysets_df.iterrows()}

Import the pre-filtered stations within the study region#

# import the BCUB (study) region boundary
bcub_df = gpd.read_file(os.path.join('data', f'catchment_attributes_with_runoff_stats.csv'), dtype={'official_id': str})
bcub_df['official_id'] = bcub_df['official_id'].astype(str)
# map the Hysets watershed IDs to the BCUB watershed IDs
# create a dict to map HYSETS watershed IDs to the Official station IDs
bcub_df['watershedID'] = bcub_df['official_id'].apply(lambda x: official_id_dict.get(x, None))
print(f'   Found {len(bcub_df)} catchments in the BCUB region with runoff statistics.')
   Found 1098 catchments in the BCUB region with runoff statistics.
/home/danbot/code/data_analysis/lib/python3.12/site-packages/pyogrio/raw.py:198: RuntimeWarning: driver CSV does not support open option DTYPE
  return ogr_read(

Import streamflow timeseries#

Note

At the top of data_processing_functions.py, update the STREAMFLOW_DIR variable to match where the HYSETS streamflow time series are stored.

# Load dataset
streamflow = xr.open_dataset(HYSETS_DIR / 'HYSETS_2023_update_QC_stations.nc')

# Promote 'watershedID' to a coordinate on 'watershed'
streamflow = streamflow.assign_coords(watershedID=("watershed", streamflow["watershedID"].data))

# Set 'watershedID' as index
streamflow = streamflow.set_index(watershed="watershedID")

# Select only watershedIDs present in bcub_df
valid_ids = [int(wid) for wid in bcub_df['watershedID'].values if wid in streamflow.watershed.values]
ds = streamflow.sel(watershed=valid_ids)
def retrieve_timeseries_discharge(stn):
    watershed_id = official_id_dict[stn]
    da = da_dict[stn]
    try:
        df = ds['discharge'].sel(watershed=str(watershed_id)).to_dataframe(name='discharge').reset_index()
    except KeyError:
        print(f"Warning: Station {stn} not found in dataset under watershedID {watershed_id}.")
        return pd.DataFrame()
    
    df = df.set_index('time')[['discharge']]
    df.dropna(inplace=True)
    df['discharge'] = np.clip(df['discharge'], 1e-4, None)
    df.rename(columns={'discharge': stn}, inplace=True)
    df[f'{stn}_uar'] = 1000 * df[stn] / da
    df[f'{stn}_mm'] = df[stn] * (24 * 3.6 / da)
    df['log_x'] = np.log(df[f'{stn}_uar'])
    return df

USGS station IDs are integers but they are stored in the dataset with the unfortunate characteristic that different stations can have identifiers that are substrings of each other. We have to add a few extra lines of code to ensure we get the correct file.

# Confirm all watershed IDs exist in ds
bcub_ws_ids = bcub_df['watershedID'].values
ds_ids = ds.watershed.values  # After selection via sel(watershed=valid_ids)

missing = [wid for wid in bcub_ws_ids if wid not in ds_ids]
n_total = len(bcub_ws_ids)
n_missing = len(missing)

assert n_missing == 0, f"{n_missing} / {n_total} watershedIDs missing from dataset. First few missing: {missing[:5]}"

Calculate Period of Record (POR) probability distribution functions for each station record#

In order to compare distributions fairly without data leakage in the subsequent prediction steps, we will first define a “global evaluation grid”, or the support over which all PDFs will be evaluated. We’ll do this by setting a global expected range of flow on a unit area basis. Since we clipped the lowest flows to 1e-4 \(m^3/s\) at import, and we know the minimum drainage area in the dataset is 0.7 \(\text{km}^2\), we can set the minimum to $10^{-4} and find the minimum and maximum unit area runoff in the dataset and leave some margin.

Since we’ll be computing KL divergences, we will also assume a prior distribution based on the heavy tailed Laplace distribution using mean and standard deviation (log) unit area runoff predicted (out of sample) from catchment attributes. This prior will be assumed on the donor/proxy PMF to avoid division by zero where the support gets very small.

# import the predicted mean and standard deviation from the previous notebook (Predict Runoff Statistics)
predicted_params = pd.read_csv('data/results/parameter_prediction_results/mean_parameter_predictions.csv', index_col=0)
param_dicts = {
    'mean': predicted_params['log_uar_mean_mean_predicted'].to_dict(),
    'sd': predicted_params['log_uar_std_mean_predicted'].to_dict(),
    'median': predicted_params['log_uar_median_mean_predicted'].to_dict(),
    'mad': predicted_params['log_uar_mad_mean_predicted'].to_dict(),
}

Define a global range of expected (unit area) runoff#

def process_station(stn, da_dict):
    df = retrieve_timeseries_discharge(stn)
    da = da_dict[stn]
    uar = 1000 * df[stn] / da  # L/s/km²
    return stn, uar.min(), uar.max()

def parallel_min_max(bcub_stations, da_dict, processes=None):
    processes = processes or 18
    with Pool(processes=processes) as pool:
        fn = partial(process_station, da_dict=da_dict)
        results = pool.map(fn, bcub_stations)

    global_min = float('inf')
    global_max = float('-inf')
    results = [res for res in results if res[1]]
    for stn, min_uar, max_uar in results:
        if max_uar > global_max:
            global_max = max_uar
            print(f'Max streamflow for {stn}: {max_uar:.2f} L/s/km²')
        if min_uar < global_min:
            global_min = min_uar
            print(f'   Min streamflow for {stn}: {min_uar:.2e} L/s/km² (DA={da_dict[stn]:.2f} km²)')

    print(f"\n Global min = {global_min:.2e} L/s/km², max = {global_max:.2f} L/s/km²")

# Usage
compute_range = False
if compute_range:
    parallel_min_max(bcub_df['official_id'], da_dict)   
def set_grid(global_min, global_max, n_grid_points=2**12):
    # self.baseline_log_grid = np.linspace(np.log(adjusted_min_uar), np.log(max_uar), self.n_grid_points)
    baseline_log_grid = np.linspace(
        global_min, global_max, n_grid_points
    )
    baseline_lin_grid = np.exp(baseline_log_grid)
    log_dx = np.gradient(baseline_log_grid)
    max_step_size = baseline_lin_grid[-1] - baseline_lin_grid[-2]
    print(f'    max step size = {max_step_size:.1f} L/s/km^2 for n={n_grid_points:.1e} grid points')
    min_step_size = baseline_lin_grid[1] - baseline_lin_grid[0]
    print(f'    min step size = {min_step_size:.2e} L/s/km^2 for n={n_grid_points:.1e} grid points')
    return baseline_lin_grid, baseline_log_grid, log_dx
# get the maximum streamflow from the streamflow dataset
# max_streamflow = ds['discharge'].max().values.item()
# max_streamflow = # it's actually 19400 in the dataset
baseline_lin_grid, baseline_log_grid, log_dx = set_grid(np.log(1e-7), np.log(1e4), n_grid_points=2**12)
    max step size = 61.7 L/s/km^2 for n=4.1e+03 grid points
    min step size = 6.20e-10 L/s/km^2 for n=4.1e+03 grid points
def gaussian_kernel(u):
    return np.exp(-0.5 * u**2) / np.sqrt(2 * np.pi)

def measurement_error_bandwidth_function(x):
    error_points = np.array([1e-4, 1e-3, 1e-2, 1e-1, 1., 1e1, 1e2, 1e3, 1e4, 1e5])
    error_values = np.array([1.0, 0.5, 0.2, 0.1, 0.1, 0.1, 0.1, 0.15, 0.2, 0.25])
    return np.interp(x, error_points, error_values, left=1.0, right=0.25)


def adaptive_bandwidths(uar, da):
    flow_data = uar * da / 1000
    unique_q = np.unique(flow_data)
    error_model = measurement_error_bandwidth_function(unique_q)
    unique_UAR = (1000 / da) * unique_q
    upper_err_UAR = unique_UAR * (1 + error_model)
    err_widths_UAR = np.log(upper_err_UAR) - np.log(unique_UAR)

    if len(unique_UAR) < 2:
        noise_bounds = (unique_UAR * (1 - error_model), unique_UAR * (1 + error_model))
        flow_data += np.random.uniform(*noise_bounds, size=flow_data.shape)
        unique_q = np.unique(flow_data)
        unique_UAR = (1000 / da) * unique_q

    log_midpoints = np.log((unique_UAR[:-1] + unique_UAR[1:]) / 2)
    left = unique_UAR[0] - (log_midpoints[0] - unique_UAR[0])
    right = unique_UAR[-1] + (unique_UAR[-1] - log_midpoints[-1])
    log_midpoints = np.concatenate(([left], log_midpoints, [right]))
    log_diffs = np.diff(log_midpoints) / 2 / 1.15

    # use the error-based bandwidth except where the log difference 
    # between unique values is greater 
    bw_vals = np.where(log_diffs > err_widths_UAR, log_diffs, err_widths_UAR)
    idx = np.searchsorted(unique_UAR, uar, side='right') - 1
    return bw_vals[idx]


class KDEEstimator:
    def __init__(self, log_grid, dx, set_global_prior=False):
        self.log_grid = np.asarray(log_grid, dtype=np.float32)
        self.dx = np.asarray(dx, dtype=np.float32)
        self.set_prior_from_laplace_fit = set_global_prior
        self._load_param_dicts()

    def _load_param_dicts(self):
        predicted_params = pd.read_csv('data/results/parameter_prediction_results/mean_parameter_predictions.csv', index_col=0)
        self.param_dicts = {
            'mean': predicted_params['log_uar_mean_mean_predicted'].to_dict(),
            'sd': predicted_params['log_uar_std_mean_predicted'].to_dict(),
            'median': predicted_params['log_uar_median_mean_predicted'].to_dict(),
            'mad': predicted_params['log_uar_mad_mean_predicted'].to_dict(),
        }

    def compute(self, stn, uar_data, da, n_years, ):
        uar_data = np.asarray(uar_data, dtype=np.float32)
        bw_values = adaptive_bandwidths(uar_data, da)
        log_data = np.log(uar_data)

        H = bw_values[:, None]
        U = (self.log_grid[None, :] - log_data[:, None]) / H
        K = gaussian_kernel(U) / H

        pdf = K.sum(axis=0) / len(log_data)
        pdf /= np.trapezoid(pdf, x=self.log_grid)
        pmf = pdf * self.dx
        pmf /= np.sum(pmf)

        posterior_pmf = pmf
        posterior_pdf = pdf
        if self.set_prior_from_laplace_fit:
            # Compute prior counts from Laplace fit
            # This is a global prior based on the mean and standard deviation of the log-mean
            # and log-standard deviation of the UAR data.
            # The prior is computed across a pre-defined "global" range to avoid data leakage.
            # The prior counts are scaled by the number of observations and the number of columns in the ensemble.
            location, scale = self.param_dicts['median'][stn], self.param_dicts['mad'][stn]
            laplace_prior_counts = compute_prior_from_laplace_fit(stn, location, scale, self.log_grid, self.dx, prior_strength=1.0/10)
            posterior_counts = pmf * len(uar_data) + (1.0 /  n_years) * laplace_prior_counts
            posterior_pmf = posterior_counts / np.sum(posterior_counts)
            posterior_pdf = posterior_pmf / self.dx

        return (posterior_pmf, posterior_pdf)
# set the minimum record length
min_record_years = 5
# choose if you want to set a global prior distribution based on the 
# Laplace fit to (out of sample) predicted location and scale
set_global_prior = False
validated_stations = bcub_df[bcub_df['n_years'].astype(float) >= min_record_years]['official_id'].values
print(f'Filtering stations with at least {min_record_years} complete years of data: {len(validated_stations)}/{len(bcub_df)} stations.')

# theres a problem with finding '0212414900' in the data, drop it
validated_stations = [stn for stn in validated_stations if stn != '0212414900']

# load the complete year dictionary
with open('data/complete_years.json', 'r') as f:
    complete_year_dict = json.load(f)
Filtering stations with at least 5 complete years of data: 1098/1098 stations.
from scipy.stats import wasserstein_distance

class ReferenceDistribution:
    def __init__(self, **kwargs):

        for k, v in kwargs.items():
            setattr(self, k, v)


    def _initialize_station(self, stn):
        self.stn = stn
        self.df = retrieve_timeseries_discharge(stn)
        self.df[f'{stn}_uar'] = 1000 * self.df[stn] / self.da_dict[stn]
        self.da = self.da_dict[stn]
        self.location = self.laplace_prior_param_dicts['median'][stn]
        self.scale = self.laplace_prior_param_dicts['mad'][stn]
        self.n_observations = len(self.df.dropna())
        self.linear_grid = np.exp(self.baseline_log_grid)
        self.EvalMetrics = EvaluationMetrics(self.baseline_log_grid, self.log_dx)
        

    def _compute_prior_from_laplace_fit(self, location, scale):
        """
        Fit a Laplace distribution to the simulation and define a 
        pdf across a pre-determined "global" range to avoid data
        leakage.  "Normalize" by setting the total prior mass to
        integrate to a factor related to the number of observations.

        The location of the Laplace distribution is the median, 
        and the scale is the mean absolute deviation (MAD).
        By Jensen's Inequality, the MAD is less than or equal to the standard deviation.
        Here we just use the predicted log-mean and log-standard deviation
        as an approximation of the Laplace distribution parameters.
        """
        prior_pdf = laplace.pdf(self.baseline_log_grid, loc=location, scale=scale)
        prior_check = np.trapezoid(prior_pdf, x=self.baseline_log_grid)
        prior_pdf /= prior_check
        assert np.min(prior_pdf) > 0, f'min prior == 0, scale={scale:.5f}'
        # convert prior PDF to PMF (pseudo-count mass function)
        return prior_pdf * self.log_dx
        
    
    def _compute_posterior_with_laplace_prior(self, kde_pmf):
        """Compute the posterior distribution using a Laplace prior.
        Ensure that the prior is not too strong by checking the
        evaluation metrics against the specified thresholds.
        If the prior has too much influence in terms of any of the metrics tested, 
        reduce the prior strength and recompute."""
        prior_pmf = self._compute_prior_from_laplace_fit(self.location, self.scale)
        # convert the pdf to counts and apply the prior
        pseudo_counts = self.n_observations * (kde_pmf + prior_pmf * self.prior_strength)

        # re-normalize the pmf
        posterior_pmf = pseudo_counts / np.sum(pseudo_counts)
        posterior_pdf = posterior_pmf / self.log_dx

        pdf_check = np.trapezoid(posterior_pdf, x=self.baseline_log_grid)
        posterior_pdf /= pdf_check
        
        # we are testing the prior influence, which means thekde pmf is
        # the "baseline_pmf" and the posterior_pmf is the "pmf_est"
        # this will tell us how far the prior has shifted the posterior from the baseline 
        metrics = self.EvalMetrics._evaluate_fdc_metrics_from_pmf(posterior_pmf, kde_pmf)
        
        # check nse and kge for values LESS THAN the specified threshold
        less_than_metrics = [metrics[k] < self.EvalMetrics.metric_limits[k] for k in ['nse', 'kge']]
        greater_than_metrics = [metrics[k] > self.EvalMetrics.metric_limits[k] for k in ['rmse', 'relative_error', 'kld', 'emd']]
        combined_thresholds = greater_than_metrics + less_than_metrics

        if np.any(combined_thresholds):
            # find which biases are greater than 1.0e-2
            # for metric_label, bias in metrics.items():
            #     if bias > self.EvalMetrics.metric_limits[metric_label]:
            #         print(f'    Prior too strong: {metric_label} bias is > 10^-2: {bias:.2e}, adjusting prior strength to {self.prior_strength:.5e}')
            self.prior_strength *= 0.5 * self.prior_strength
            return self._compute_posterior_with_laplace_prior(kde_pmf)
        
        return posterior_pdf, posterior_pmf
    
    
    def _compute_kde(self):
        self.kde = KDEEstimator(self.baseline_log_grid, self.log_dx)
        complete_years= self.complete_year_dict[self.stn]
        pmf, pdf = self.kde.compute(self.stn, self.df[f'{self.stn}_uar'].values, da_dict[self.stn], complete_years)
        return pmf, pdf
def compute_baseline_distribution_by_kde(inputs):
    stn, input_config = inputs
    baseline_distribution = ReferenceDistribution(**input_config)
    baseline_distribution._initialize_station(stn)
    pmf, _ = baseline_distribution._compute_kde()
    pdf_posterior, pmf_posterior = baseline_distribution._compute_posterior_with_laplace_prior(pmf)
    return (stn, pmf_posterior, pdf_posterior)
# from kde_estimator import KDEEstimator
pdf_path = Path(os.getcwd()) / 'data' / 'results' / 'baseline_distributions' / f'bcub_pdfs.csv'
pmf_path = Path(os.getcwd()) / 'data' / 'results' / 'baseline_distributions' / f'bcub_pmfs.csv'

shared_config = {
    'baseline_log_grid': baseline_log_grid, 
    'da_dict': da_dict,
    'log_dx': log_dx,
    'complete_year_dict': complete_year_dict,
    'laplace_prior_param_dicts': param_dicts,
    'prior_strength': 0.01,
    }

if os.path.exists(pdf_path):
    pmf_df = pd.read_csv(pmf_path, index_col='q')
    pdf_df = pd.read_csv(pdf_path, index_col='q')
else:
    # compute the PDF and PMF for each station    
    inputs = [(stn, shared_config) for stn in validated_stations]    
    print(len(validated_stations), 'stations meeting minimum complete period of record.')
    
    with Pool() as pool:
        results = pool.map(compute_baseline_distribution_by_kde, inputs)
    # for input in inputs:
    #     results = compute_baseline_distribution_by_kde(input)
        

    # concatenate the results
    stations, pmfs, pdfs = zip(*results)
    pdf_df = pd.DataFrame(np.stack(pdfs, axis=1), columns=stations)
    pmf_df = pd.DataFrame(np.stack(pmfs, axis=1), columns=stations)

    pdf_df['q'] = baseline_lin_grid
    pmf_df['q'] = baseline_lin_grid

    # save the pdf and pmf files
    pdf_df.set_index('q', inplace=True)
    pmf_df.set_index('q', inplace=True)
    pdf_df.to_csv(pdf_path)
    pmf_df.to_csv(pmf_path)

Compute the entropy of each PMF#

Note

The remainder of this notebook is not part of the analysis presented in the accompanying paper.

The entropy of the distribution represents the randomness or disorder of the system, and it is given by the sum of log probabilities of the system states:

\[H(X) = \sum_{i=1}^N -\log p_i\]
# compute the entropy of the posterior distribution for each station

bits = list(range(2, 13)) # set a range that is both too low and too high for the data
entropy_output_folder = Path(os.getcwd()) / 'data' / 'results' / 'entropy_results'
if not entropy_output_folder.exists():
    entropy_output_folder.mkdir(parents=True, exist_ok=True)

states = [2**b for b in bits]
eps = 1e-22 # set a small epsilon to avoid numerical issues
for s in states:
    q_values = pmf_df.index.values
    # resample the PMF by q_values to the number of states
    resampled_q = np.linspace(np.log(q_values.min()-eps), np.log(q_values.max()+eps), s)
    # resampled_q = np.exp(resampled_q)  # convert back to original scale
    # create a new DataFrame with the resampled PMF
    pmf_resampled = pd.DataFrame(
        index=np.exp(np.linspace(np.log(q_values.min()),
                                 np.log(q_values.max()),
                                 s)),
        columns=pmf_df.columns,
        dtype=float
    )
    for stn in pmf_df.columns:
        pmf = pmf_df[stn].values
        bin_numbers = np.digitize(q_values, pmf_resampled.index.values)
        tmp = pd.DataFrame({'pmf': pmf, 'bin': bin_numbers})
        agg = tmp.groupby('bin')['pmf'].sum()
        full_pmf = agg.reindex(range(1, s+1), fill_value=0)
        pmf_resampled[stn] = full_pmf.values
        assert np.isclose(full_pmf.sum(), 1), f'PMF for {stn} does not sum to 1: {np.sum(full_pmf):.6f}'

    bits = int(np.log2(s))
    pmf_resampled.index.name = 'q'
    pmf_resampled.to_csv(entropy_output_folder / f'bcub_pmf_resampled_{bits}bits.csv')
from bokeh.palettes import RdYlGn

def logspace_edges_from_linear_centers(centers):
    """Given linear-space bin centers from a log-uniform KDE, return log-space edges in linear space."""
    log_centers = np.log(centers)
    dlog = log_centers[1] - log_centers[0]
    log_edges = np.concatenate([
        [log_centers[0] - dlog / 2],
        log_centers + dlog / 2
    ])
    return np.exp(log_edges)  # return edges in linear space


pdf_fig = figure(title="BCUB PDF Resampled", width=750, height=400, x_axis_type='log')
entropy_distributions = {}

for s in states:
    bits = int(np.log2(s))
    pmf_path = entropy_output_folder / f'bcub_pmf_resampled_{bits}bits.csv'
    pmf_resampled = pd.read_csv(pmf_path, index_col=0)

    pmf = np.percentile(pmf_resampled, 50, axis=1)  # median PMF across stations

    # Compute bin edges and widths from linear bin centers
    centers = pmf_resampled.index.astype(float).values
    edges = logspace_edges_from_linear_centers(centers)
    dx = np.diff(edges)

    # Convert PMF to PDF and normalize
    pdf = pmf / dx
    x_centers = (edges[:-1] + edges[1:]) / 2
    pdf /= np.trapezoid(pdf, x=x_centers)  # ensure integral = 1

    # Entropy of PMF (still valid for info content calc)
    mask = pmf > 0
    entropy = -np.sum(pmf[mask] * np.log2(pmf[mask]))
    ratio = entropy / bits

    # Plot PDF
    pdf_fig.quad(
        top=pdf, bottom=0,
        left=edges[:-1], right=edges[1:],
        fill_color=RdYlGn[11][states.index(s) % len(RdYlGn[11])],
        line_color=None, alpha=0.7,
        legend_label=f'{bits:.0f}b (H={entropy:.1f}, {100*ratio:.0f}%)'
    )

# Final formatting
pdf_fig.legend.click_policy = 'hide'
pdf_fig.legend.location = 'top_right'
pdf_fig.xaxis.axis_label = "Unit Area Runoff (L/s/km²)"
pdf_fig.yaxis.axis_label = "Probability Density"
pdf_fig = dpf.format_fig_fonts(pdf_fig, font_size=14)

show(pdf_fig)
def compute_empirical_cdf(values):
    """Compute the empirical cumulative distribution function (CDF) of the given values."""
    sorted_values = np.sort(values)
    cdf = np.arange(1, len(sorted_values) + 1) / len(sorted_values)
    return sorted_values, cdf
from bokeh.palettes import Viridis10
# plot the CDFs of entropy by number of bits
entropy_dist_fig = figure(title="Entropy Distributions by Number of Bits", width=750, height=400)
colours = Viridis10
for s in states:
    bits = int(np.log2(s))
    pmf_resampled = pd.read_csv(entropy_output_folder / f'bcub_pmf_resampled_{bits}bits.csv', index_col=0)
    # drop the 'q' column
    # pmf = pmf_resampled.drop(columns=['q']).values
    # broadcast the computation of entropy across all columns
    # to compute the entropy for each column
    pmf = pmf_resampled.values
    entropy_by_column = np.where(pmf > 0, pmf * np.log2(pmf), 0.0)
    sample_entropy = -np.sum(entropy_by_column, axis=0) 
    mean_entropy = np.mean(sample_entropy)
    x, cdf = compute_empirical_cdf(sample_entropy)

    entropy_dist_fig.line(x, cdf, line_width=3, color=colours[states.index(s) % len(colours)],
                          legend_label=f'{bits:.0f}b (H={mean_entropy:.1f})')
entropy_dist_fig.legend.location = 'bottom_right'
entropy_dist_fig.legend.background_fill_alpha = 0.5
entropy_dist_fig.xaxis.axis_label = "Entropy (bits)"
entropy_dist_fig.yaxis.axis_label = "Cumulative Probability"
entropy_dist_fig.legend.click_policy = 'hide'
entropy_dist_fig = dpf.format_fig_fonts(entropy_dist_fig, font_size=14)
show(entropy_dist_fig)
/tmp/ipykernel_3249174/2214694350.py:13: RuntimeWarning: divide by zero encountered in log2
  entropy_by_column = np.where(pmf > 0, pmf * np.log2(pmf), 0.0)
/tmp/ipykernel_3249174/2214694350.py:13: RuntimeWarning: invalid value encountered in multiply
  entropy_by_column = np.where(pmf > 0, pmf * np.log2(pmf), 0.0)
def logspace_edges_from_linear_centers(centers):
    """Given linear-space bin centers from a log-uniform KDE, return log-space edges in linear space."""
    log_centers = np.log(centers)
    dlog = log_centers[1] - log_centers[0]
    log_edges = np.concatenate([
        [log_centers[0] - dlog / 2],
        log_centers + dlog / 2
    ])
    return np.exp(log_edges)  # return edges in linear space
from bokeh.palettes import Viridis10 as colours
pmf_fig = figure(title=" PDF by Percentile", width=1000, height=400, x_axis_type='log')
percentiles = 2, 10, 25, 50, 75, 90, 95, 96, 98, 99
for pct in percentiles:
    pmf_resampled = pd.read_csv(entropy_output_folder / f'bcub_pmf_resampled_8bits.csv', index_col=0)
    pmf = np.percentile(pmf_resampled, pct, axis=1)

    linear_centers = pmf_resampled.index.astype(float).values
    edges = logspace_edges_from_linear_centers(linear_centers)
    dx = np.diff(edges)  # linear bin widths corresponding to log-space bins
    pdf = pmf / dx  # convert PMF to PDF
    # compute the area under the PDF
    area = np.trapezoid(pdf, x=edges[:-1])
    pdf = pdf / area  # normalize the PDF to have unit area
    area = np.trapezoid(pdf, x=edges[:-1])
    assert np.isclose(area, 1), f'Area under PDF for {pct}th percentile does not equal 1: {area:.6f}'
    # print(asdf)
    entropy_by_column = np.where(pmf > 0, pmf * np.log2(pmf), 0.0)
    sample_entropy = -np.sum(entropy_by_column, axis=0) 
    
    # print(len(dx), len(edges), len(pmf))
    # print(asdf)
    bits = np.log2(s)
    ratio = entropy / bits
    # add an edge at the end to close the PMF
    
    pmf_fig.line(
        x=edges[:-1],
        y=pdf,
        line_width=4,
        color=colours[percentiles.index(pct) % len(colours)],
        legend_label=f'{pct}th Percentile (H={entropy:.1f} {100*ratio:.0f}%)'
    )
    pmf_fig.legend.click_policy = 'hide'
    pmf_fig.legend.location = 'top_right'
    # pmf_fig.xaxis.axis_label = r'$$\text{Unit Area Runoff } \frac{L}{s\cdot \text{km}^2}$$'
    # pmf_fig.yaxis.axis_label = r'$$\text{PDF } P(X\leq x)$$'
    pmf_fig.xaxis.axis_label = "Unit Area Runoff (L/s/km²)"
    pmf_fig.yaxis.axis_label = "Probability Density"
    pmf_fig.legend.background_fill_alpha = 0.5
    pmf_fig.add_layout(pmf_fig.legend[0], 'right')
    pmf_fig = dpf.format_fig_fonts(pmf_fig, font_size=14)
show(pmf_fig)
from bokeh.models import LogColorMapper, ColorBar, FixedTicker, ColumnDataSource, LinearColorMapper
from bokeh.palettes import Viridis256

percentiles = np.linspace(0.01, 99.9, 500)
exceedance_probs = 1 - (percentiles / 100)  # Convert percentiles to exceedance probabilities
pmf_resampled = pd.read_csv(entropy_output_folder / 'bcub_pmf_resampled_11bits.csv', index_col=0)
assert np.allclose(pmf_resampled.sum(), 1), "PMF does not sum to 1 across all stations."
linear_centers = pmf_resampled.index.astype(float).values
edges = logspace_edges_from_linear_centers(linear_centers)
dx = np.diff(edges)
x_vals = edges[:-1]
log_x = np.log10(x_vals)

z_matrix = []
for pct in percentiles:
    pmf = np.percentile(pmf_resampled, pct, axis=1)
    # pmf /= np.sum(pmf)
    z_matrix.append(pmf)

z_image = np.flipud(np.array(z_matrix))

# Use log color mapping for dynamic range of PDF values
mapper = LogColorMapper(palette=Viridis256, low=z_image[z_image > 0].min(), high=z_image.max())
# Create figure with x-axis in log space (via manual transformation)
p = figure(
    title="PMF across Large Sample of Watersheds",
    # x_range=(log_x.min(), log_x.max()),
    x_range=(x_vals.min(), x_vals.max()),
    y_range=(0.01, 99.9),
    # y_range=(np.min(exceedance_probs), np.max(exceedance_probs)),
    width=1000,
    height=500,
    x_axis_type="log",
)

p.image(
    image=[z_image],
    x=x_vals.min(),
    y=1,
    dw=x_vals.max() - x_vals.min(),
    dh=98,
    color_mapper=mapper
)

# Axis labels
p.xaxis.axis_label = "Unit Area Runoff (L/s/km²)"
p.yaxis.axis_label = "Percentile"

# Color bar
color_bar = ColorBar(title="Probability Mass",  color_mapper=mapper, label_standoff=12)
p.add_layout(color_bar, 'right')

show(p)
mean_pmf = pmf_resampled.mean(axis=1)  # Mean PMF per bin

# Create a new figure for the mean PDF
mean_pdf_fig = figure(title="Mean PDF of Unit Area Runoff", width=1000, height=400, x_axis_type='log')
mean_pdf_fig.line(x=x_vals, y=mean_pmf, line_width=2, color='blue', legend_label='Mean PDF')
mean_pdf_fig.xaxis.axis_label = "Unit Area Runoff (L/s/km²)"
mean_pdf_fig.yaxis.axis_label = "Probability Density"
mean_pdf_fig.legend.location = 'top_right'
mean_pdf_fig.legend.click_policy = 'hide'
mean_pdf_fig = dpf.format_fig_fonts(mean_pdf_fig, font_size=14)
show(mean_pdf_fig)

Next steps#

In the subsequent chapters, we’ll use the KL divergence computed for each station in a gradient boosting model to test their predictability from catchment attributes.

Citations#